Tough Mining

نویسنده

  • Steven Dickman
چکیده

Caenorhabditis elegans, a 1-mm soil-dwelling roundworm with 959 cells, may be the best-understood multicellular organism on the planet. As the most " pared-down'' animal that shares essential features of human biology—from embryogenesis to aging—C. elegans is a favorite subject for studying how genes control these processes. The way these genes work in worms helps scientists understand how diseases like cancer and Alzheimer's develop in humans when genes malfunction. With the publication of a draft genome sequence of C. elegans' first cousin, C. briggsae, Lincoln Stein and colleagues have greatly enhanced biologists' ability to mine C. elegans for biological gold. Every organism carries clues to its molecular operating system and evolutionary past embedded in the content and structure of its genome. To unearth these clues, scientists examine different regions of the genome, assembling data on sequences, genes, functional elements that are not genes (but that regulate them, for example), repeated sequences, and so on. By comparing the genomes of related organisms, researchers can see what parts of the genomes are conserved—highly conserved genes tend to be important—and then focus on these regions to track down genes and determine how they function. To construct a draft sequence of the C. briggsae genome, the researchers merged genomic data from three sources—one derived from whole-genome shotgun sequencing, another from physical genome mapping, and the third from regions of a previously " finished'' sequence. For the shotgun sequence, the researchers extracted DNA from worms, randomly cut it into short pieces, sequenced them, and then assembled overlapping sequences to create thousands of stretches of contiguous DNA sequence. To help fill in the gaps between these " contigs,'' Stein and colleagues developed a " fingerprint'' map of the genome as a guide for aligning the shorter fragments. The map also helped them identify inconsistencies and misalignments in the genome assembly. Finally, they integrated the previously finished sequence to improve the draft genome sequence. Using these massive datasets, the authors produced a high-quality genome sequence; although it does not quite meet the gold standard of a " finished'' sequence, it covers 98% of the genome and has an accuracy of 99.98%. After confirming the accuracy of the draft, the researchers turned to the substance of the genome. Examining two species side by side, scientists can quickly spot genes and flag interesting regions for further investigation. Analyzing the organization of the two genomes, Stein et al. not only found …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient strategies for tough aggregate constraint-based sequential pattern mining

Frequent sequential pattern mining with constraints is the task of discovering patterns by incorporating the user defined constraints into the mining process, thus not only improving mining efficiency but also making the discovered patterns to better meet user requirements. Though many studies have been done, few have been carried out on the ‘‘tough aggregate constraints’’ due to the diffIculty...

متن کامل

A solution to tough GNSS land applications using terrestrial-based transceivers (LocataLites)

Acceptable RTK GPS/GNSS performance is heavily dependent on a relativity unobstructed sky-view, where there are at least five satellites with good geometry available, and on the reliability of the wireless data link used for differential corrections. In “tough” GNSS applications where satellite occlusion is common, such as open cast mining, the RTK based technology often fails to deliver the re...

متن کامل

Discovering 'Tough Love' Interventions Despite Dropout

This paper reports an application to educational intervention of Principal Stratification, a statistical method for estimating the effect of a treatment even when there are different rates of dropout in experimental and control conditions. We consider the potential value for using principal stratification to identify “Tough Love Interventions” – interventions that have a large effect but also i...

متن کامل

An Efficient Framework for Mining Flexible Constraints

Constraint-based mining is an active field of research which is a key point to get interactive and successful KDD processes. Nevertheless, usual solvers are limited to particular kinds of constraints because they rely on properties to prune the search space which are incompatible together. In this paper, we provide a general framework dedicated to a large set of constraints described by SQL-lik...

متن کامل

Pattern-growth Methods for Frequent Pattern Mining

Mining frequent patterns from large databases plays an essential role in many data mining tasks and has broad applications. Most of the previously proposed methods adopt apriorilike candidate-generation-and-test approaches. However, those methods may encounter serious challenges when mining datasets with prolific patterns and/or long patterns. In this work, we develop a class of novel and effic...

متن کامل

Mining Frequent Item Sets with Convertible Constraints

Recent work has highlighted the importance of the constraint-based mining paradigm in the context of frequent itemsets, associations, correlations, sequential patterns, and many other interesting patterns in large databases. In this paper, we study constraints which cannot be handled with existing theory and techniques. For example, , , ( can contain items of arbitrary values) "!$# %'&)( , are ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • PLoS Biology

دوره 1  شماره 

صفحات  -

تاریخ انتشار 2003